Skip to content

feat(spark): Add compute-on-read support for BatchFeatureView in get_…#6357

Merged
ntkathole merged 7 commits intofeast-dev:masterfrom
SIDDHESH1564:feature/spark-bfv-compute-on-read
May 3, 2026
Merged

feat(spark): Add compute-on-read support for BatchFeatureView in get_…#6357
ntkathole merged 7 commits intofeast-dev:masterfrom
SIDDHESH1564:feature/spark-bfv-compute-on-read

Conversation

@SIDDHESH1564
Copy link
Copy Markdown
Contributor

@SIDDHESH1564 SIDDHESH1564 commented May 1, 2026

What this PR does / why we need it:

When using @batch_feature_view with TransformationMode.PYTHON in the Spark offline store, get_historical_features() fails with UNRESOLVED_COLUMN errors. This occurs because the PIT join SQL reads directly from the raw batch_source and expects transformed feature columns (e.g., aggregated outputs) to already exist in the source data. However, BFV transformations are only executed during feast materialize, not during offline retrieval.

This PR introduces compute-on-read support for BatchFeatureView in SparkOfflineStore. Before generating the PIT join SQL, BFVs with a UDF are detected and their transformations are applied:

  1. Read the raw source into a Spark DataFrame
  2. Invoke the BFV's feature_transformation.udf() (same function used during materialization)
  3. Register the transformed DataFrame as a Spark temporary view
  4. Replace the table_subquery in the query context with the temp view name

This enables reuse of BFV definitions during offline training without requiring pre-materialization or external ETL pipelines. The entire pipeline remains fully distributed in Spark.

Which issue(s) this PR fixes:

Fixes #6345

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests
  • Manual tests
  • Testing is not required for this change

Added 7 unit tests covering:

  • BFV with UDF → table_subquery replaced with temp view
  • UDF invoked with source DataFrame
  • Transformed DataFrame registered as temp view
  • Plain FeatureView passes through unchanged
  • BFV without UDF passes through unchanged
  • Mixed BFV + plain FeatureView scenarios
  • All non-transformation context fields preserved

Misc

Changes:

  • sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py:
    • Added BatchFeatureView import
    • Added _apply_bfv_transformations() helper function
    • Integrated call into get_historical_features() between query context construction and PIT join SQL generation
  • sdk/python/tests/unit/infra/offline_stores/contrib/spark_offline_store/test_spark_bfv_compute_on_read.py (new):
    • 7 unit tests for compute-on-read behavior

@SIDDHESH1564 SIDDHESH1564 requested a review from a team as a code owner May 1, 2026 19:12
@franciscojavierarceo
Copy link
Copy Markdown
Member

@copilot can you apply make format-python on this PR?

@franciscojavierarceo franciscojavierarceo changed the title feat(spark): add compute-on-read support for BatchFeatureView in get_… feat(spark): Add compute-on-read support for BatchFeatureView in get_… May 1, 2026
…historical_features

Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>
@SIDDHESH1564 SIDDHESH1564 force-pushed the feature/spark-bfv-compute-on-read branch from fcdf0e6 to 11d69be Compare May 2, 2026 04:28
Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated
Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated
…n logic

Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>
@SIDDHESH1564 SIDDHESH1564 requested a review from ntkathole May 2, 2026 16:52
Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated
Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated
Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated
…V source resolution

Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>
@ntkathole ntkathole merged commit 630d9f8 into feast-dev:master May 3, 2026
25 of 26 checks passed
franciscojavierarceo pushed a commit that referenced this pull request May 4, 2026
# [0.63.0](v0.62.0...v0.63.0) (2026-05-04)

### Bug Fixes

* Add project filter to apply_data_source and delete_data_source (closes [#6206](#6206)) ([#6322](#6322)) ([96562c4](96562c4))
* Add project_id filter to SnowflakeRegistry UPDATE path ([#6243](#6243)) ([6658b71](6658b71)), closes [#6208](#6208) [#6208](#6208)
* Add subprocess timeouts to prevent test_e2e_local hanging on Dask atexit handler ([3de6556](3de6556))
* Ambiguous truth value of array during materialization ([#6259](#6259)) ([d0c8984](d0c8984))
* Auto-detect GCS/S3 registry store when registry is passed as string ([#6260](#6260)) ([7ebcf03](7ebcf03))
* **bigquery:** Prefer query over table in get_table_query_string ([#6360](#6360)) ([77ed779](77ed779)), closes [#6200](#6200)
* correct project_id scoping in get_user_metadata and delete_project ([0c469a7](0c469a7))
* disable Redis RDB persistence in test deployments ([44cd682](44cd682))
* Disable snowflake tests temporarily in CI ([#6356](#6356)) ([31d5a98](31d5a98))
* Filter empty SQL commands at execute_snowflake_statement call sites ([#6249](#6249)) ([92ffbb9](92ffbb9))
* Fix five bugs in milvus online store ([#6275](#6275)) ([212504b](212504b))
* Fix issue with apply feature view ([835cda8](835cda8))
* Fix streaming materialization for exotic sources with lazy UDF pipelines ([c07972d](c07972d))
* Handle missing features gracefully instead of panicking ([7d00b3a](7d00b3a))
* Harden informer cache with label selectors and memory optimizations ([#6242](#6242)) ([3f11356](3f11356))
* **helm:** Avoid nil pointer for metrics.enabled inside podAnnotations ([#6251](#6251)) ([c833f1a](c833f1a))
* Include git in feast server image ([fb03c46](fb03c46))
* Include StreamFeatureView in freshness metric ([#6269](#6269)) ([463f16c](463f16c))
* Pre-create S3A event log dir before SparkContext init ([#6317](#6317)) ([9feca77](9feca77))
* Remote Online Store Type Inference Error with All-NULL Columns ([#6063](#6063)) ([de67bdd](de67bdd))
* Remove selector with kustomize overlay using a JSON 6902 patch ([9107a43](9107a43))
* Resolve multiple bugs in SnowflakeRegistry and Snowflake connection handling ([#6315](#6315)) ([7e66a2e](7e66a2e))
* **spark:** BatchFeatureView with TransformationMode.PYTHON now reads all source columns ([a310eaf](a310eaf))
* **spark:** Use SELECT * when feature_name_columns is empty in pull_all_from_table_or_query ([e1b1d2d](e1b1d2d))
* Support pandas mode in feature builder and fix dask column extraction ([863315e](863315e))
* support SQL string as entity_df in RemoteOfflineStore.get_historical_features ([c559889](c559889))
* Wrap LocalOutputNode return value in ArrowTableValue for consist… ([#6286](#6286)) ([a16cd55](a16cd55))

### Features

* Add agent skills and Cursor/Claude rules for Feast development ([312eea3](312eea3))
* Add feature view versioning support to FAISS online store ([b36acb7](b36acb7))
* Add feature view versioning support to Redis and DynamoDB online stores ([#6257](#6257)) ([edf25af](edf25af)), closes [#6164](#6164) [#6163](#6163)
* Add optional 'org' in feature view ([#6288](#6288)) ([#6301](#6301)) ([608b105](608b105))
* Add RaySource, to_ray_dataset first-class method, docs, and tests ([1c98157](1c98157))
* Add TLS support for Go Feature Server ([#6229](#6229)) ([28a58d0](28a58d0))
* Add Vector Search support to MongoDBOnlineStore ([#6344](#6344)) ([c102738](c102738))
* Add versioning support to Milvus online store ([#6330](#6330)) ([3268ced](3268ced))
* Addresses performance issues in the Redis online store ([2e50da0](2e50da0))
* Allow to set gpu for ray ([5580ab4](5580ab4))
* Bump redis-py version cap from <5 to <8 ([#6339](#6339)) ([9538180](9538180))
* Expose feature_server, materialization, and openlineage configuration via FeatureStore CRD ([ec6ecfd](ec6ecfd))
* Make online_write_batch_size configurable in MaterializationConfig ([#6268](#6268)) ([d41becf](d41becf))
* Make udf optional if agg defined ([#5689](#5689)) ([#6328](#6328)) ([f630056](f630056))
* MongoDB offline store ([#6138](#6138)) ([8eebad7](8eebad7))
* Optional input_schema for ODFV ([#6308](#6308)) ([#6312](#6312)) ([f08b4e8](f08b4e8))
* Provision minimal TokenReview RBAC for OIDC auth and add SSL error logging in token parser ([#6240](#6240)) ([dca57e8](dca57e8))
* **spark:** Add compute-on-read support for BatchFeatureView in get_… ([#6357](#6357)) ([630d9f8](630d9f8))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BFV Compute-on-Read for get_historical_features() in SparkOfflineStore

3 participants